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[57] ABSTRACT 

A computer system and computer implemented method 
automatically classify video sequences into categories. A set 
of categories is denned either manually through the asso- 
ciation of selected video sequences with user supplied 
category designations, or automatically through segregation 
of a set of video sequences into groups of similar sequences. 
Input video sequences are then classified by either pixel 
decomposition or primitive attribute decomposition; the 
former analyzing each image on a pixel basis, the latter 
employing extracted image information. Categories can be 
trained as new video sequences are input into the system, or 
new categories can be created to accommodate such new 
sequences that are dissimilar from existing categories. 

13 Claims, 7 Drawing Sheets 
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METHOD AND SYSTEM FOR AUTOMATIC All these factors have produced a demand for systems and 

CLASSIFICATION OF VIDEO IMAGES products that aid the storage, identification, and retrieval of 

graphic images and video. This is because designers of 

RELAXED APPLICATION multimedia software products, computer graphic artists, and 

~ . .... . . , . _i, w - c „ 5 even individual users, often have extensive libraries of 

m rS ^ oT° ^^S^S^Sm ™or Photographs? digitized video, or other computer 

ec ? generated graphic images, for incorporating such materials 

SEARCHING GRAPHIC IMAGES AND VIDEOS riled on £ multimedia produc^Tlius a designer may have hundreds, 

Sep. 30, 1994, which is incorporated herein by reference. Qr ^0^^ of of peoplej animals, urban settings, 

BACKGROUND OF THE INVENTION 10 landscapes, sporting events, or any other category of images, 

and may have hours of similarly diverse video footage, all 

1. Field of the Invention for creating multimedia presentations. Similarly, with 
The invention relates to methods and systems for the the emergence of desktop video production, video producers 

image analysis, and more particularly, to methods and ^ typically develop extensive libraries of video for use by 

systems for automatically classifying images into categories. ^ themselves, or others, to aid in the creation of new works. 

2. Description of the Background Art Other businesses that have existing libraries of video, and 
There is currently a growing demand for computer sys- that generate large quantities of video, such as television 

terns that can produce, edit, and manipulate graphic images, stations, film studios, and the like, will eventually produce 

and in particular, for systems that can create, edit, or and store increasing quantities of video using computers and 

manipulate digitized video images. This demand is gener- 2Q mass storage devices. 

ated by several market influences and consumer trends. To effectively use a library of images or video, the 
There has been, and will continue to be, steady growth in the software designer must be able to retrieve an image or video 
multimedia marketplace for computer-based products that according to certain visual attributes or characteristics in the 
integrate text, audio, graphics and increasingly video, for image. For example, the designer may need an single image 
educational, entertainment, and business purposes. Also, the 25 or even video footage of a sunset over a ocean shore for a 
use of video for educational or business presentations and given project, and would need a way to locate that image 
for artistic or personal applications has become increasingly from many other images, without having to review many 
popular as the costs of video production equipment has hours of video, or numerous photographs that may or may 
fallen. Products ranging from video games to computerized not match the desired visual characteristics of the image. In 
encyclopedias to computerized training guides now com- 30 the past, such retrieval was manually performed. For com- 
monly employ digitized video to entertain, educate, and puter based image retrieval to be useful, some type of image 
instruct. analysis and classification of the visual characteristics of the 
These consumer trends are matched by various techno- images is necessary in order to speedup the retrieval process 
logical advances that have made widespread the use of video and make computer based storage an effective and efficient 
for computer based applications. Equipment to digitize 35 tool. 

video at high speed and quality has allowed software design- The visual attributes or statistical qualities of images have 
ers to integrate video into commercial software products been extensively researched, and there are many techniques 
such as computer games, and has allowed individual com- for determining various aspects of an image, such as density 
puter users to incorporate video into business presentations and distribution of its colors, the presence and degree of 
or other similar projects. Improvements in telecommunica- 40 motion between two images, the presence and position of 
tions and network technologies, such as increased transfer distinct objects, and the like. However, most of these tech- 
rates, bandwidth, and the like, have made realistic the niques have been developed for use in two principal areas, 
opportunity for computer users of all types to access online compression techniques for communicating or storing 
libraries of video with acceptable speed and quality. images and video, and pattern recognition techniques for 
The rise of desktop video production, including the devel- 45 determining whether a particular image matches a given 
opment of video compression standards such as MPEG, reference, such in industrial part inspection, 
have reduced the cost of video production systems, making These various image analysis techniques have not previ- 
pre- and post- production systems accessible to more users ously been used for classifying images. Rather, classifying 
and businesses. There are now available a number of soft- images is typically based on storing images in a database 
ware products for multimedia authoring that handle video, 50 with descriptive text annotations. The designer then searches 
graphics, audio, animation in the development environment. by inputting a text description of an image and attempting to 
Such technologies have been made possible by increases in locate images that have a matching text description. There 
microprocessor power coupled with dramatic reductions in are numerous problems with using this approach to classify 
cost. Personal computers now offer performance previously images and video. 

found only in engineering workstations, or mainframes. 55 First, a human observer must view each image in the 

In addition to computation power and sophisticated database. This is an extremely time consuming process, 

software, improvements in storage capacities and compres- especially in a database that may contain thousands of 

sion technologies have increased the ability to store digitized images, and must be repeated for each image added to the 

video, which typically requires large storage needs. Uncom- database. Second, during viewing, the observer must decide 

pressed NTSC quality video requires 15 Mb per second for 60 which visual elements of an image are significant in deter- 

30 fjps video, or almost 1 Gb for a minute's worth of video. mining the proper classification of the image. This subjec- 

The MPEG standard for video image compression provides tive judgment may overlook various image details that may 

for a 40:1 compression ratio, allowing a hour's video later be part of image characteristics for which the user is 

footage in about 13 Gb of storage capacity. Compression searching by reviewing a list of classification. Thus the 

also facilitates network access, and thus the developments of 65 observer may not note or descriptive specific objects in the 

video libraries that allow user to select and retrieve video background of image, or implicit elements of an image or 

footage in real, or near real time. video such as panning or zooming. Even in still images, the 
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user may overlook significant colors, shapes, ihe presence of 
persons, or other elements. As a result of these subjective 
judgments, the observer's classification of the image may be 
either too general (classifying an image of a sunset over the 
beach as merely a "Sun & Sky") or too specific ("Sunset on 5 
The Strand 1 '). When the classification is too general, many 
dissimilar images will be included in the classification, 
thereby diluting the value of the classification for discrimi- 
nating images. Where the classification is too narrow, too 
few images will be included in later classifications, thus 10 
increasing the number of distinct classifications that the user 
must review in order to locate a desirable image. 

Classifying video even more difficult and time consum- 
ing. In order to classify a video, an observer must view the 
entire video, noting its various contents, such as different 15 
scenes, and when each occurs, along with a description of 
each scene and aspects significant for later retrieval. Again, 
not every feature will be noted by the observer; this is an 
even more significant problem for video since there is 
typically more "content" to a video in terms of varying 20 
images than a single photograph, and thus a single classi- 
fication of video is likely be inadequately descriptive of all 
of the content. None of these approaches use computer based 
analysis of the images to classify a desired image. 

Pattern recognition techniques have been previously used 25 

to classify images with computers. These techniques have 

usually been specialized to a particular field, for example 

analysis of satellite imagery or component identification for 

defect analysis. In addition, these techniques have typically . - 

, , , - . .j r • * L ■ 30 images for classification; 

dealt only with still images not video. Existing techniques ^ t a ^ 

have generally hardwired the classification engine since only 
a small number of known categories is typically of interest. 
However, for general video analysis it is necessary to 
provide more flexible classification methods and to allow 
inclusion of time based features such as motion. 

Accordingly, it is desirable to provide various methods for 
classifying images according to their image attributes for 
later retrieval. Where a user creates numerous images or is 
constantly adding such images to an image database, auto- 
matic classification of images should categorize new images 
on the basis of various user supplied criteria. In addition, it 
is desirable to provide for adaptive learning of the user's 
classification of images based on image attributes in user 
classified images. 



system can automatically determine the categories by deter- 
mining a set of primitive attributes for each image or video 
sequence, and then associating images or sequences having 
similar primitive attribute values into distinct categories. 

Once the eigen vectors are determined, then a new video 
sequence or image can be classified by projecting the image 
or frames of a video sequence onto the eigen vectors for each 
category. A distortion value is measured, representing the 
distance between projected value of the image on the eigen 
vectors, and the eigen values for all images or video 
sequences in the category. Where the projection is done for 
each primitive attribute, the distortion for an image is taken 
as the weighted sum of the various individual distortions for 
each primitive attribute. The distortion values for all cat- 
egories are compared, and the image or video sequence is 
classified in the category having the lowest distortion value. 
If the lowest distortion value exceeds a predetermined 
threshold, then either a new category can be created for the 
image or video sequence, or the category with the lowest 
distortion can be retained including the new image in the 
training set. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is an illustration of a system 10 for automatic 
classification of video images; 

FIG. 2 is a flowchart of the overall method of classifying 
video images; 

FIG. 3 is a flowchart of a method of categorizing video 
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FIG. 4 is a flowchart of a method for manually defining 
categories for classification; 

FIG. 5 is flowchart of a method for automatically defining 
categories for classification; 

FIG. 6 is a flowchart of a method of automatically 
classifying video sequences by orthogonal decomposition of 
primitive attributes; 

FIG. 7 is a flowchart of a method of automatically 
classifying video sequences by orthogonal decomposition of 
the pixel domain; 

FIG. 8 is a flowchart of a method of transforming an 
image or video sequence into a canonical space defined by 
a set of eigen vectors. 
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SUMMARY OF THE INVENTION 

The invention provides a method of automatically clas- 
sifying images and video sequences by developing a set of 
categories, each category represented by a set of eigen 
vectors and eigen values in a vector space. The vector space 
can be defined by primitive attributes of the images, such as 
color, texture, motion, luminance, and the like, or by a 
generalized pixel decomposition. The eigen vectors repre- 
senting each category are determined from images or video 
sequences that are designated as belonging to the category. 
This set of images is used as a training set for the category. 

A category is trained, and the eigen vectors are deter- 
mined for a category, by generating matrices of dot products 
between each combination of images or video sequences in 
the category. The highest energy eigen values and associated 
vectors are extracted as the basis set for the category. The 
images or video sequences in a given category can deter- 
mined either by the user or automatically. The user deter- 
mines the images by selecting various images or video 
sequences, as graphically representing on a display, and 
designating them for inclusion in a particular category. The 



DETAILED DESCRIPTION OF THE 
INVENTION 



Referring now to FIG. 1, there is shown one embodiment 
of a system for automatically classifying images and video 

50 sequences. The classification system 10 includes a processor 
109 operatively coupled to a display 103, a pointing device 
105, a keyboard 111, a mass storage device 107, and an 
addressable memory 117. In the preferred embodiment the 
processor 109 is from the 68000 family or PowerPC family 

55 produced by Motorola, Inc., and used in the Macintosh™ 
brand of personal computers manufactured by Apple 
Computer, Inc. The mass storage device 107 is for perma- 
nently storing images, including graphic images, digitized 
photographs and video sequences, including digitized (or 

60 digitally produced) video images, or animations. The mass 
storage device 107 is conventional design, with sufficient 
capacity, preferably in excess of 500 Mb, to store a large 
number of digitized images or video sequences. The mass 
storage device 107 may be a large capacity hard disk, a 

65 CD-ROM, WORM, laser-disk, or other magnetic, optical, or 
similar device for storing large volumes of digitized video. 
The images may be stored in the mass storage device 107 in 
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an image database 113, or other suitable data storage for 
easy retrieval and indexing. Images are input into the image 
database 113 by digitizing them with the digitizer 101, or by 
composing them in conventional graphic design or video 
production applications. The display 103 is also of conven- 
tional design and should have sufficient resolution to display 
at least 640x480 pixels, preferably with at least 16 bit color 
depth. The display 103 is also used to display a user interface 
to the classification system 10, the user interface provided by 
the user interface controller 125. The pointing device 105 
may be a mouse, a stylus, a touch-sensitive screen, or a voice 
activated command processor, or the like device, for pro- 
viding inputs to the processor 109 via the user interface 
controller 125, such as for controlling a cursor and other 
elements provided in the user interface. A keyboard 111 is 
also provided for inputting commands to the classification 
system 10. 

The addressable memory 117 stores a classification soft- 
ware application 119 that controls the processor 109 for 
effecting the methods of the present invention. The classi- 
fication application 119 includes a category trainer 121 
which determines the cigen vectors representing each cat- 
egory. The eigen vector generator 131 creates dot product 
matrices for each category. The user interface controller 125 
manages the display of the user interface and receives and 
interprets user commands input to the classification appli- 
cation 119. The image projector 123 projects an image or set 
of primitive attributes onto a set of eigen vectors during 
category classification. The best match detector 127 deter- 
mines the classification into a specific category by finding 
the category with a lowest distortion value. The space 
transformer 129 is used to transform an image or video 
frame into a canonical space of a given primitive attribute. 
The operation of these various code modules is further 
described below. 

This description will hereafter refer to the use of the 
system 10 to classify video sequences, which are comprised 
of a temporal series of related video frames, and it is 
understood the invention can be used to classify individual 
video images or frames. 

Referring to FIG. 2 there is shown a flowchart of the basic 
steps used to automatically classify video sequences accord- 
ing to the invention. First, the classification application is 
trained 100 for an initial set of categories C that will be used 
to classify video sequences V in the image database. The 
training establishes the individual categories C, and deter- 
mines the eigen values and eigen vectors that will define 
each category C. Once the categories C have been trained, 
the user inputs 200 a new video sequence into the system, or 
retrieves one from storage in the image database. The video 
sequence is then classified using one of the classification 
methods of the present invention. These include orthogonal 
decomposition of each video sequence using image 
attributes, orthogonal decomposition of each video sequence 
in the pixel domain, and a neural net based classification. 
Each of these operations is more completely described 
below. 

Category Definition 

In order to classify a set of video sequences V, a number 
of categories C must be previously developed, and this is 
accomplished by the category trainer 121 module. Referring 
to FIG. 3 there is shown a flowchart of training process 
managed by the category trainer 121. 

First a number of categories C are defined 101 for 
classification. Category definition can be performed either 
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manually by the user, or automatically by the system. 
Referring to FIG. 4, when done manually, the user creates 
101.1 a number of category labels to be used to segregate 
any input video sequences. In a preferable user interface, the 

5 categories C are iconographically represented as individual 
folder with appropriate labels. In such a user interface, the 
user can access categorized video sequences by convention- 
ally opening various folders. In an alternative user interface, 
the category labels may be listed as a series of text items. In 

10 addition to defining new categories, the user may import 
1012 an existing set of categories. To manually establish the 
quantitative parameters of each category Cj the user then 
associates 1013 any number of video sequences with a 
selected category Cj. In a preferred user interface, this is 

15 done by moving an icon representing each video sequence 
into the desired category folder. For example, the user may 
establish a category folder for video sequences containing 
footage of horses, and would then move 101.4 icons repre- 
senting various different video sequences of such footage 

20 into the category folder. In another user interface 
embodiment, the user may designate a category, for example 
by selecting a category name from a list, and then select or 
"stamp" 101.5 a number of video sequences in order to 
indicate their inclusion in the category. The steps defining 

25 different categories and associating video sequences with 
selected categories can be repeated as desired by the user. 

The user may also decide to have the system automati- 
cally determine the categories C that a given number of 
video sequences V fall into. This is done as follows, as 

30 shown in FIG. 5. First, the user designates 101.6 a set of 
video sequences V to categorize. For each video sequence Vi 
designated, the category trainer 121 generates 101.7 a set of 
primitive attributes for the video sequence. The primitive 
attributes are quantitative measures of various scalar and 

35 complex feature sets of a video sequence. The primitive 
attributes include an average binned color histogram of all 
frames in the video sequence, average luminance or 
intensity, Wold texture parameters, average motion vectors, 
and the like. These primitive attributes can be used to 

40 describe a video sequence as a vector in an orthogonal vector 
space defined by the distinct primitive attribute types. Once 
the primitive attributes of all of the video sequences have 
been determined, the category trainer 121 then associates 
101.8 video sequences with similar sets of primitive 

45 attributes into distinct classes. The set of primitive attributes 
for a particular video sequence can be thought of as a vector, 
with the value of each primitive attribute being a component 
of the vector. The vectors for all the sequences are clustered 
using LBG (linde-Buzo-Gray), or a similar vector analysis. 

50 The result of this clustering is the association of similar 
video sequences with a same centroid. Once all of the video 
sequences have been segregated, the category trainer 121 
then prompts 101.9 the user to input a name for each 
category. 

55 When the video sequences V have been segregated into 
different categories C, either manually or automatically, the 
category trainer 121 then calls the eigen generator 131 to 
create a basis set of eigen values and eigen vectors for each 
category using either pixel decomposition or primitive 

60 attribute decomposition. 

Category Training 
Referring again to FIG. 3, with pixel decomposition, the 
category trainer 121 creates a basis set of eigen values and 
65 eigen vectors representing the luminance and color charac- 
teristics of video sequences in each category Cj. This is done 
as follows, A given category C has Vn video sequences, n>l . 
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Each video sequence is a concatenation of frames. A video 
sequence then represented is as a vector V=<F 1? F 2 , F 3 , . . . 
F n >, where each F is a string of pixel values. 

For each video sequence Vi in category C, 1 ^iS n, the dot 
product of Vi is taken 107a with every video sequence Vj, 
producing 109a a Vn Vn covariance matrix of dot products: 

' ViSVt V{EV 2 v<lv h ' 
V£V 2 V£.V n 
VJV 2 v^v n 

In the preferred embodiment preprocessing is used to 
remove repeated or redundant frames and consequently 
reduce the vector dimension of the covariance matrix. 

Alternatively, the video sequences V can be decomposed 
using their primitive image attributes instead of the pixel 
values. In this case, for each video sequence Vi, the primitive 
attributes calculated above are retrieved, each video 
sequence Vi having primitive attributes Vi,m, where m 
indexes each different type of primitive attribute. Then for 
each primitive attribute Vijn, the dot product is taken with 
the primitive attribute Vj,m of each video sequence Vj, here 
producing a set of m covariance matrices of dot products, 
one for each primitive attribute type. Each category Cj thus 
has a set of m covariance matrices. 

In either case, either pixel decomposition or primitive 
attribute decomposition, the eigen generator 131 is called by 
the category trainer 121 to produce the appropriate matrices. 
Once the covariance matrix or matrices of dot products is 
generated, it is diagonalized by the eigen generator 131 to 
extract 111 a set of eigen values and eigen vectors repre- 
senting the category Cj. The eigen values are then ordered 
113 to determine the highest energy eigen values. The eigen 
vectors associated with the highest energy eigen values are 
retained 115 by finding the maximum eigenvalue emax, 
discarding eigenvalues ei (and associated eigenvectors) 
when ei/emax is less than a defined threshold. The threshold 
is preferably chosen to be 0.1 to form a basis set of eigen 
vectors for the category Cj. This process is repeated for each 
category C to be trained, and can also be repeated as desired 
by the user to retrain categories when new video sequences 
are added or removed from a category. 

Automatic Classification 

After the categories have been initially trained, the user 
may then automatically classify new or additional video 
sequences. A new video sequence Vn is then input 200 into 
the system 10 from the video source, or by other means, 
including computer based generation using multimedia 
authoring tools. The user then designates the video sequence 
to be classified by selecting the video sequence Vn and 
issuing an appropriate command, such as retrieving a menu 
item, or similar means. The present invention provides for 
classification of the video sequence by several different 
methods, again including orthogonal decomposition of 
primitive attributes or decomposition of the pixel domain. 

Referring to FIG. 6 there is shown a flowchart for 
automatically classifying video sequences by orthogonal 
decomposition of primitive attributes. First, the set of primi- 
tive attribute m is generated 301 for the new video sequence 
Vn. This is done for each frame of the video sequence. Then 
for each existing category C, each of the primitive attributes 
m for each frame of Vn are compared with the correspond- 
ing eigen vectors representing the primitive attributes of C. 
This is done for each primitive attribute Vo,m in each frame, 
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by first transforming 307 the video sequence Vn into the 
same canonical space as the eigen vectors for the category. 
This transformation may include correction of scale, 
rotation, normalization of the primitive attribute value, spa- 

5 rial distortion, blurring, and the like. This transformation is 
further described with respect to FIG. 8. 

Once the video sequence Vn has been canonically trans- 
formed 307, it is projected 309 by the image projector 123 
onto each matrix of eigen vectors that was generated for 

10 each of the primitive attributes m. Each projection takes the 
dot product of primitive attribute Vn,m and the correspond- 
ing covariance matrix M for the category C. Projection 
generates a vector, each component of the vector corre- 
sponding to the projection onto an element of the eigen set. 

15 The vector for each primitive attribute is then compared 311 
to all the projections for that primitive attribute from all 
video sequences V in category C The comparison takes the 
distance, or sum of squared differences of vector 
components, between a primitive attribute vector m, and all 

20 the vectors from the video sequences in C. This distance 
gives an indication of how "close" a given video sequence 
Vn is to other video sequences V in category C with respect 
to the eigen vectors defining the primitive attribute m for the 
category. As the projections for each primitive attribute are 

25 done for each frame of a video sequence Vn, the minimum 
distance between all frames of Vn and the set of projections 
is taken 313 as the distortion for that primitive attribute 
Vn,m with respect to category C. 
This process of transformation, projection, and compari- 

30 son is repeated 305 for each primitive attribute m for the 
covariance matrix of category C. This produces for each 
category C a set of distortion values for the video sequence 
Vn. The total distortion D for each category C is taken 315 
as the weighted sum of the distortion values for all image 

35 attributes Vn,m for that category: 

m 

D c - 2 wtfk 

40 where Dc is the total distortion for each category C for a new 
video sequence Vn, w A is the weighting for each primitive 
attribute m, and dk is the distortion value 11 for video 
sequence Vn with respect to the projections for the primitive 
attribute m. 

45 The total distortions Dc are then sorted by the best match 
detector 127 and the category C with the total distortion Dl 
is determined 317 to be the appropriate classification of the 
new video sequence Vn. In other words, the new video 
sequence is most similar to, or least different from, other 

50 video sequences in the selected category along each of the 
orthogonal dimensions represented by the different primitive 
attributes used to differentiate the categories. 

It may turn out that a new video sequence Vn does not 
properly belong to any existing category. This arises when 

55 the lowest total distortion Dl is greater 319 than a predeter- 
mined threshold. The threshold can be adjusted by the user 
to allow either broader categories (a higher threshold value) 
or narrow, more precise categories (a lower threshold). If the 
lowest total distortion Dl is greater than the threshold, then 

60 the system 10 prompts 321a the user for a new category Cn, 
and the system 10 will then add the video sequence Vn to the 
new category Cn, and invoke the category trainer 121 to 
develop the appropriate covariance matrices for the new 
category. If the user does not wish to create a new category, 

65 for example, because the category with the lowest total 
distortion Dl is the category which the user wants the video 
sequence Vn classified in, then the user may cause the 
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system 10 to retrain category C by calling the category error. When the reconstruction error is minimized, then the 

trainer 121 to retrain the category by including the new scale of the video frame has been correctly normalized- The 

video sequence Vn. reconstruction error is minimized as follows. 

Classification of a new video sequence Vn may also be After a first reconstruction error el is generated as 

done by orthogonal decomposition on the pixel domain. 5 described (or using other equivalent difference measures), 

FIG. 7 shows a flowchart for classifying video sequences in the scale of the input image is adjusted 709 by upsamphng 

this manner. As described above, when the category trainer or downsampUng the image by a predetenmned amount. The 

121 was used to train the categories using pixel adjusted *P ut 15 a g* D °n ^\ cl ^ n ' 

decomposition, there is a created a single covariance matrix vector <? generate another set of we^hts wi*. /mother 

r • ' , 4 , , , , n M^-fi, t u« m reconstruction error e2 is then determined by reconstructing 

of eigen vectors for the category. In order to classify then^ 10 ^ descM ^ ^ ncomtnusdxm 

for each category Ci, the video sequence is transformed W me 

mtouiecanom^spaceofmee^envectorsfortte • is determined, indicating 

according to the method desenbed with respect to FIG. 8. . . . . .' . ° 

' c .. . , . i™ ^jao whether the reconstruction errors are increasing or decreas- 

Each frame of the video sequence Vn is then protected Wi , r . . . 

, . . , . ~ , u . • ing. If the reconstrucUon errors are increasing, then the 

onto the covariance matrix for the category Q, that is, 15 • .... .. .. ... ,. _ £ ,,• 

. .. .. r . uf _ % ./„ „„,„ ... ' scaling is in the wrong direction, and the direction of scaling 

projecting the pixel values of the frames of Vn onto the set emargemeDt or r £, uctioQ) b reveIsed . if the reconstruc- 

of eigenvectors for the category C. In the preferred l » decreasing then the scale is moving in the 

embodiment subsampling of the frames of the . video ^ ^ ^ ^ ^ coQtmue tQ fee adjusted 

sequence and removal of repeated or redundant frames a m fa Uje ^ ^ melhod me m , 0 

done prior to projecbon. 20 ^e the ^ fof m ^ the recoD . 

Tne projection is then compared 411, as above, wdh the ^ ^ minimized me vid£ ^ « ^ 

projections for aU other video sequences m category Ci, to ?u ^ ^ for classification 

produce a distortion value, as the sum of squared airierences. M „ . r . , , 

W minimum distortion value for all frames in the video . The foregomg information method can be used for any 

sequence Vn is taken 413 as the distortion for the video 25 ™& "J*"* ^ ^sented by a set of eigen 

n , , . ~. ™_- - . a Af\i vectors El and a set 01 weiants wi, such as particular 

sequence Vn in category Ci. This process is repeated 403 tor , 7/ pr. ' . £ " . , 
each cate o C textures, colors, gradients, any defined region ot pixels, and 
Ca Tlie a cSry C with the lowest total distortion Dl, as * c ^ ? r aQV combination thereof. In addition to normal- 
defined above/is designated 417 as the classification of the of scaling, the reconstrucUon errors can be employed 
new video sequence Vn. Again, the system will test 419 30 to normalize translations, rotations, or other operations on 
whether Dl exceeds a defined threshold, and if so, provide the input image 

for either creating 421 a new category with the new video The preferred embodiment of the invention has been 

sequence Vn as its member, or retraining 423 the category described as computer based system employing particular 

^ software for configuring and operating the computer system. 

35 Alternatively, the invention may be embodied in application 

Image Transformation specific integrated circuitry, or in programmable logic 

Referring now to FIG. 8, there is shown a flowchart for a devices such as programmable logic arrays, digital signal 

method for transforming an image or video sequence into a processors, or the like. This would allow the invention to be 

canonical space defined by a set of eigen vectors. Transfer- incorporated in video storage and playback systems, such as 

mation into the canonical space ensures that the projection 40 dedicated video storage or playback systems, for example, 

of a video sequence onto the covariance matrices produces or systems based on CD or similar optical disk technology 

accurately representative vectors which represent that allow large volumes of digitized video to be captured, 

significant, rather the spurious, differences between the processed, and stored for subsequent retrieval, 

primitive attributes of the video sequence and the primitive We claim: 

attributes of the category. As one example of the type of 45 1. A method of automatically classifying a video sequence 

transformation that can be performed, scaling of an image into a category, the video sequence including at least one 

will be used. It is understood that other transformations, frame, comprising the steps of: 

such as luminance normalization, rotation, stretching, nor- creating a set of categories, each category representing a 

malization of color, or the like, can be similarly performed. set of video sequences having a set of similar primitive 

In order to normalize scaling in a particular input image 50 image attribute values orthogonally representing the 

for category classification, a set of eigen-vectors Ei and category, by: 

weights wi is provided 701 for a given primitive attribute, receiving a user specification of selected video 

here scale. Hie eigen vectors for the primitive attribute are sequences; 

determined from a predetermined sample image that has determining for each selected video sequence at least 

canonical values for the image attribute. For scale, there 55 one primitive image attribute value; 

would be different resolutions (sizes) of the sample image. segregating the video sequences into sets, each set of 

The eigen vectors for the sample image thus represent the video sequences having a set of similar primitive 

canonical space for the primitive attribute. image attribute values; 

A frame from the video sequence Vn being classified is defining each category by associating each of the sets 

projected 703 onto each eigen vector Ei, and a weight wi* 60 of video sequences with each category; 

is obtained for each eigen vector Ei. The frame is then for each category, creating a covariance matrix of dot 

reconstructed 705 forming reconstructed frame ! from all the products for each pair of the video sequences in the 

eigen-images Ei with the new weights wi*. The recon- category; and 

structed frame t is then compared 707 with the sample image determining a set of eigen vectors from the covariance 

to produce a reconstruction error by taking the sum of the 65 matrix as the set of similar primitive image attribute 

squared pixel to pixel difference between the sample image values of the category; 

and I. This reconstruction error is dependent on the scale receiving an input video sequence; 
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determining a distortion measure for each category with 
respect to the input video sequence by projecting the 
input video sequence onto the set of similar primitive 
image attribute values of each category; and 

classifying the input video sequence in the category 
having a minimum distortion measure. 

2. The method of claim 1, wherein the step of creating a 
covariance matrix of dot products further comprises the 
steps of: 

for each primitive image attribute value, creating a cova- 
riance matrix of dot products for the primitive image 
attribute value for each pair of video sequences; and 

wherein the step of determining a set of eigen vectors 
further comprises the step of: 

determining from the covariance matrix for each primi- 
tive image attribute value a set of eigen vectors, the 
set of eigen vectors being the set of similar primitive 
image attribute values of the category. 

3. The method of claim 2, the step of determining a 
distortion measure, further comprising for each category, the 
steps of: 

determining for the input video sequence a set of primi- 
tive image attribute values; 

determining for each primitive image attribute value of 
the input video sequence a primitive image attribute 
vector comprising a dot product between the primitive 
image attribute value and the set of eigen vectors of the 
set of similar primitive image attribute values for the 
category; 

comparing each primitive image attribute eigen vector 
with the primitive image attribute vectors for each input 
video sequence associated with the category to produce 
a distortion measure for each primitive image attribute 
value of the input video sequence; 

determining for each primitive image attribute value of 
the input video sequence a distortion measure having a 
minimum value; and 

determining a total distortion measure for the category as 
a function of the minimum distortion measures for each 
primitive image attribute value of the input video 
sequence. 

4. The method of claim 1, the step of detennining a 
distortion measure, further comprising for each category, the 
steps of; 

determining for each frame of the input video sequence a 
frame vector comprising a dot product between the 
frame and the eigen vectors of the category; 

comparing each frame vector with frame vectors for each 
video sequence associated with the category to produce 
a distortion measure for each frame; and 

determining the distortion measure having a minimum 
value. 

5. The method of claim 1, further comprising the steps of: 
comparing the minimum distortion measure with a thresh- 
old; and 

if the minimum distortion measure exceeds the threshold, 
creating a new category including the input video 
sequence. 

6. The method of claim 1, further comprising the steps of; 
comparing the minimum distortion measure with a thresh- 
old; 

if the minimum distortion measure exceeds the threshold, 

associating the input video sequence with the category 

having minimum distortion measure; 
recreating the covariance matrix of dot products for each 

pair of video sequences associated with the category; 

and, 
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re-determining the set of eigen vectors from the covari- 
ance matrix as the set of similar primitive image 
attribute values of the category. 

7. A computer system for automatically classifying an 
input video sequence including at least one frame into a 
category, comprising: 

a processing unit programed to: 

create a set of categories, each category representing a 
set of video sequences having a set of similar primi- 
tive image attribute values orthogonally representing 
the category, the categories created by the processing 
unit further programmed to: 
receive a user specification of selected video 
sequences; 

determine for each selected video sequence at least 

one primitive image attribute value; 
segregate the video sequences into sets, each set of 

video sequences having a set of similar primitive 

image attribute values; 
define each category by association of each of the 

sets of video sequences with each category; 
for each category, create a covariance matrix of dot 

products for each pair of video sequences in the 

category; and 
determine a set of eigen vectors from the covariance 

matrix as the set of similar primitive image 

attribute values of the category; 
receive the input video sequence; 
determine a distortion measure for each category with 
respect to the input video sequence by projecting the 
input video sequence onto the set of similar primitive 
image attribute values of each category; and 
classify the input video sequence in the category having 
a minimum distortion measure. 

8. The computer system of claim 7, further comprising: 
a storage device, operatively coupled to the processing 

unit, and storing thereon a plurality of video sequences, 
a plurality of categories, each category defined by a set 
of similar primitive image attribute values and having 
a category designation; 

a video input device operatively coupled to We storage 
device, for receiving video sequences from a source 
external to the computer system, and storing the video 
sequences on the storage device; and 

a display device, operatively coupled to the processing 
unit, for displaying thereon selected portions of 
selected video sequences, and flier displaying category 
designations; and 

an electronic pointing device, responsive to user inputs, 
for associating a selected portion of a selected video 
sequence with a selected category designation. 

9. The computer system of claim 7, wherein the process- 
ing unit is further programmed to: 

for each primitive image attribute value, create a covari- 
ance matrix of dot products for the primitive image 
attribute value for each pair of video sequences; and, 

determine from the covariance matrix for each primitive 
image attribute value a set of eigen vectors, the sets of 
eigen vectors being the set of si similar primitive image 
attribute values of the category. 

10. The computer system of claim 9, wherein the pro- 
cessing unit is further programmed to: 

determine for each primitive image attribute value of the 
input video sequence a primitive image attribute vector 
as a dot product of the primitive image attribute values 
and the set of eigen vectors of the set of similar 
primitive image attribute values for the category; 
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produce a distortion measure for each primitive image 
attribute value by comparing each primitive image 
attribute vector with primitive image attribute vectors 
for each video sequence associated with the category; 

determine for each primitive image attribute value a 
distortion measure having a minimum value; and 

determine a total distortion measure for the category as a 
function of the minimum distortion measures for each 
primitive image attribute value. 

11. The computer system of claim 7, wherein the pro- 
cessing unit is further programmed to: 

determine for each frame of the video sequence a frame 
vector as a dot product between the frame and the eigen 
vectors of the category; 

compare each free vector with frame vectors for each 
video sequence associated with the category to produce 
a distortion measure for each frame; and 

determine the distortion measure having a minimum 
value. 

12. The computer system of claim 7, wherein the pro- 
cessing unit is further programmed to: 
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compare the minimum distortion measure with a thresh- 
old; and 

if the minimum distortion measure exceeds the threshold, 
create a new category and include the input video 
sequence. 

13. The computer system of claim 7, wherein the pro- 
cessing unit is further programmed to: 

compare the minimum distortion measure with a thresh- 
10 old; 

if the minimum distortion measure exceeds the threshold, 
associate the input video sequence with the category 
having the minimum distortion measure; 

recreate the covariance matrix of dot products for each 
pair of video sequences associated with the category; 
and, 

re-determine the set of eigen vectors from the covariance 
matrix as the set of similar primitive image attribute 
20 values of the category. 

***** 
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Patent are hereby corrected as shown below: 

In the Claims: 

Claim 8, Column 12, Line 40, please delete "We" and insert - -the- - In its 
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Claim 8, Column 12, Une 46, please delete "flier and insert - -further- - in its 
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